Visualizing NMDS Results from Taxonomic Abundance Tables

Sample Code-Through

Published

February 17, 2023

Script Author: Matt Selensky (github.com/mselensky)

Documented by: Emily Kelleher

Preamble

This script is for visualizing diversity and community similarity from data processed from 16S rRNA gene amplicon analysis. Specifically, this takes in the collapsed ASV table from QIIME2. Before NMDS analysis is performed, rarefying the data is critical.

Installing Packages and Loading Libraries

Libraries must be loaded at the beginning of each new R script. In this example script, the necessary libraries are tidyverse, bngal, parallel, viridisLite, and vegan. Whenever you are running a new library, you will need to install the package first, otherwise you will receive an error because your libraries won’t recognize the package. Once you install a package you won’t need to install it again, unless there is an update to R or to the package itself, which is why it’s important to occasionally check the R documentation for updates to the software. Here is a demonstration as to how to load in your libraries:

Code
library(tidyverse)
library(bngal)
library(parallel)
library(viridisLite)
library(vegan)

Importing Data

The next step is to load in your data sets. It is helpful to include a data folder within your project, in order to keep your data organized from your scripts and other files. For csv files, always use the read_csv() function. If dealing with an rda/rds file, use the load() function, and if using a txt file, you can use read_delim(). For any other file types, you can always look up which functions to use online. The data being loaded in this example is an “ASV” (Amplicon Sequence Variant) table and the metadata.

Code
asv_table <- read_csv("data/example-asv-table.csv", col_types = cols()) 
metadata <- read_csv("data/example-metadata.csv", col_types = cols())

Defining Functions

The following code defines two functions: get_nmds_scores and get_nmds_vectors. get_nmds_scores takes three inputs: metaMDS_output, which is the output from the vegan::metaMDS function which performs Nonmetric Multidimensional Scaling, metadata, which is a data frame with information on the samples used in the NMDS analysis, and joining_varibale. The scores() function extracts the NMDS scores, and the function returns a tibble with the NMDS scores from metaMDS_output, joined with the metadata using the joining_variable. The row names have now been converted to a separate column specified by the joining_variable argument. The other function, get_nmds_scores, first extracts the NMDS scores from env_output and then converts the sites into a data frame using the scores() function. The output is a tibble with row names converted to a separate column. It then extracts the p-values for the environmental variables from the env_output and joins them with the NMDS vectors based on the variable names. Finally, it renames the NMDS vector columns and filters the results based on the specified p-value threshold, returning only the top variables with the strongest correlations to the NMDS ordination.

Code
get_nmds_scores <- function(metaMDS_output, metadata, joining_variable) {
  scores_ <- scores(metaMDS_output) 
  
  scores_[["sites"]] %>%
    as.data.frame() %>%
    rownames_to_column(var = joining_variable) %>% as_tibble()
}
get_nmds_vectors <- function(env_output, top, pvalue) {
  vectors <- env_output %>% scores(display = "vectors") %>%
    as.data.frame() %>%
    rownames_to_column(var = "variable")
  pvals <- env_output$vectors$pvals %>%
    as.data.frame() %>%
    rename(., pval = `.`) %>%
    rownames_to_column(var = "variable")
  vectors <- left_join(vectors, pvals, by = "variable") %>%
    rename(NMDS1_vec = "NMDS1", NMDS2_vec = "NMDS2") %>%
    filter(pval < pvalue)
  vectors <- top_n(vectors, -(top))
  vectors
}

Taxonomic Level and Abundance

In this script, “asv” is assigned to the variable called tax.level in order to define which taxonomic level you want to plot. Next, a vector tax.levels is defined by the different taxonomic levels at which you want to bin the taxonomic abundance data. In this example, the levels are “phylum”, “class”, “order”, “family”, “genus”, and “asv”. An empty list, binned_tax, is then created. A for loop runs through the taxonomic levels in tax_levels, and at each one, the bngal::bin_taxonomy() function is called with the asv.table, meta.data, and tax.level arguments. This function takes as input a table of ASV abundances (asv_table), metadata about the samples (metadata), and a specified taxonomic level (i in this case) and groups the ASVs into taxonomic units at that level. The resulting data for each taxonomic level is stored in the binned_tax list using the taxonomic level as a key.

Code
tax.level = "asv"

tax_levels = c("phylum", "class", "order", "family", "genus", "asv")
binned_tax=list()
for (i in tax_levels) {
  binned_tax[[i]] <- bngal::bin_taxonomy(asv.table = asv_table, 
                                         meta.data = metadata, 
                                         tax.level = i)
}

Calculating Alpha and Beta Diversities

The next step is to calculate the Shannon and Beta diversities. Shannon diversity measures the diversity within a single sample, and beta diversity measures the diversity between samples. The bngal::get_alpha.div() function will calculate three different alpha diversity indices (Shannon, Simpson, and Inverse Simpson) at every taxonomic level. The ASV level from the binned taxonomic abundance data is stored in the binned_tax list, and the result is stored in a variable called alpha_diversity. Then, the alpha_diversity table is filtered to only include rows where the diversity index is “Shannon” and the taxonomic level is equal to the tax.level variable defined earlier. The shannon_div object includes the Shannon results for the defined taxonomic level, but could be modified so that index == "simpson" & tax_level == "family" for family-level Simpson results, for instance. However, it’s important to note that shannon_div is not included in the final plot_nmds.

Code
alpha_diversity <- bngal::get_alpha.div(binned_tax)
shannon_div <- alpha_diversity %>%
  filter(index == "shannon" & tax_level == tax.level)

Afterwards, the code will loop through the taxonomic levels in tax_levels, and for each level, create a relative abundance matrix and a presence-absence matrix. To create the relative abundance matrix, the code uses the bngal::make_matrix() function to convert the binned taxonomic abundance data for that taxonomic level into a matrix with samples as rows and taxonomic units as columns. The values in the matrix represent the relative abundance of each taxonomic unit in each sample.

To create the presence-absence matrix, the code starts with the relative abundance matrix created in the previous step and replaces all non-zero values with 1, effectively creating a matrix of presence (1) or absence (0) of each taxonomic unit in each sample. The relative abundance and presence-absence matrices are stored in separate lists called rel_abun_mat and pres_abs_mat.

Code
rel_abun_mat = list()
for (i in tax_levels) {
  rel_abun_mat[[i]] <- binned_tax[[i]] %>%
    bngal::make_matrix(count_column_name = "rel_abun_binned",
                       row_names = "sample-id")
}

pres_abs_mat = list()
for (i in tax_levels) {
  pres_abs_mat[[i]] = rel_abun_mat[[i]]
  pres_abs_mat[[i]][pres_abs_mat[[i]] > 0] <- 1
}

Creating NMDS Objects at each Taxonomic Level

Next, calculate Nonmetric Multidimensional Scaling (NMDS) using the Bray-Curtis distance metric on two types of matrices, relative abundance and presence-absence, at multiple taxonomic levels. The mcapply() function applies the vegan::metaMDS() function to each matrix in the rel_abun_mat and pres_abs_mat lists. For Windows users, specify that mc.cores = 1 in the mcapply wrapper. The vegan::metaMDS() function performs NMDS on a distance matrix and returns a set of coordinates that represent the samples. The k argument is set to 2 to produce two-dimensional coordinates. The mc.cores argument is specified to mcalapply() so that the vegan::metaMDS() function is applied to each matrix in parallel using multiple cores. These coordinates for each matrix are then stored in a new list called rel_abun_nmds for relative abundance matrices and pres_abs_nmds for presence-absence matrices. The log10() function is also used to transform the relative abundance matrices prior to performing NMDS. The resulting NMDS plots will be used to visualize the similarities and differences between the samples based on their taxonomic composition. The code also calculates the time taken to produce the NMDS plots for both types of matrices and displays them as a message.

Code
t0<-Sys.time()
rel_abun_nmds <- mclapply(rel_abun_mat,
                          FUN = function(i) {
                            #message(" | [", Sys.time(),"] --------- Calculating NMDS at the ", i, "-level... ---------")
                            vegan::metaMDS(log10(i+1), distance = "bray", k=2)
                          },
                          mc.cores = detectCores()-1)
t1<-Sys.time()
message(format(t1-t0), "required to produce NMDS for log10-transformed relative abundance matrices")
t0<-Sys.time()
pres_abs_nmds <- mclapply(pres_abs_mat,
                          FUN = function(i) {
                            #message(" | [", Sys.time(),"] --------- Calculating NMDS at the ", i, "-level... ---------")
                            vegan::metaMDS(i, distance = "bray", k=2)
                          },
                          mc.cores = detectCores()-1)
t1<-Sys.time()
message(format(t1-t0), "required to produce NMDS for binary matrices")

Extracting NMDS Scores

NMDS scores are extracted by creating two lists of scores, one for relative abundance data and one for presence-absence data, at different taxonomic levels. Using a for loop, each iteration assigns NMDS scores for the taxonomic level to the corresponding element in each of the lists, rel_abun_scores and pres_abs_scores. The get_nmds_scores() function is called with two arguments, a data object containing NMDS coordinates for that taxonomic level (rel_abun_nmds[[i]] and pres_abs_nmds[[i]]), and metadata information with the column “sample-id” to match the samples. The resulting NMDS scores are then renamed using dplyr::rename() function, adding a prefix with the taxonomic level and indicating whether it’s for relative abundance data (rel) or presence-absence data (pres).

Code
rel_abun_scores <- list()
pres_abs_scores <- list()
for (i in tax_levels) {
  rel_abun_scores[[i]] <- get_nmds_scores(rel_abun_nmds[[i]], metadata, "sample-id") %>%
    dplyr::rename(!!paste0("NMDS1_rel_", i) := NMDS1, !!paste0("NMDS2_rel_", i) := NMDS2)
  
  pres_abs_scores[[i]] <- get_nmds_scores(pres_abs_nmds[[i]], metadata, "sample-id") %>%
    dplyr::rename(!!paste0("NMDS1_pres_", i) := NMDS1, !!paste0("NMDS2_pres_", i) := NMDS2)
}

Merging NMDS Scores into a Single Object for Plotting

Before merging the NMDS scores, the message() function is called to print a message to the console, indicating the current time stamp (Sys.time()) and the phrase “Relative abundance NMDS scores joining by:”. The purrr::reduce() function is then used to reduce the list of NMDS scores to a single data frame by iteratively joining each element with the previous one. The NMDS scores are merged together by using the left_join() function to join each element in the list, merging the data frames based on matching sample IDs. After the join is completed, the resulting data frames are stored in rel_scores and pres_scores.

Code
message(" | [", Sys.time(), "] Relative abundance NMDS scores joining by:")
rel_scores <- purrr::reduce(rel_abun_scores, left_join)
pres_scores <- purrr::reduce(pres_abs_scores, left_join)

Creating Long-Form Plottable NMDS Scores

With the merged NMDS scores, we can now transform the object into a long-form data frame in order to plot it. The pivot_longer() function is used to convert the data frame from wide format to long format, with one row per sample and per NMDS axis. The cols argument specifies the columns to pivot, which in this case are all columns except for the first one (2:ncol(.)). The names_to and values_to arguments specify the names of the new columns created during the pivot operation, which are “NMDS” and “score”, respectively. The separate() function is used to split the “NMDS” column into three new columns based on the underscore separator, resulting in columns for “NMDS”, “type” (either “rel” or “pres”), and “tax_level”. The pivot_wider() function is then used to spread the “NMDS” column back into separate columns for each NMDS axis. The filter() function is used to remove rows where the “tax_level” column equals “domain”. The resulting long-form data frames, rel_scores_long and pres_scores_long, can now be used for visualization.

Code
rel_scores_long <- rel_scores %>%
  pivot_longer(cols = 2:ncol(.), names_to = "NMDS", values_to = "score") %>%
  separate(NMDS, into = c("NMDS", "type", "tax_level")) %>%
  pivot_wider(names_from = "NMDS", values_from = "score") %>%
  filter(tax_level != "domain")
pres_scores_long <- pres_scores %>%
  pivot_longer(cols = 2:ncol(.), names_to = "NMDS", values_to = "score") %>%
  separate(NMDS, into = c("NMDS", "type", "tax_level")) %>%
  pivot_wider(names_from = "NMDS", values_from = "score") %>%
  filter(tax_level != "domain")

Function that Plots NMDS Filled by Alpha Diversity Values

This function takes a data frame of NMDS scores in long format (scores.long), a variable to fill the points by (fill.var), metadata (meta.data), environmental data (env.long), an index to subset the data by (alpha.index), and a variable to divide the data by (alpha.div). First, it joins the NMDS scores data frame with the metadata data frame and environmental data frame if provided. It then creates a base plot with geom_point(), which is ggplot layer that creates a scatter plot, using NMDS1 as the x-axis and NMDS2 as the y-axis, and filled by the variable fill.var. The plot is facet-wrapped by taxonomic level, and the legend is customized. If environmental data is provided, the function adds geom_segment and geom_label_repel layers to show the relationships between the environmental variables and the NMDS scores. The function then checks if alpha.index is missing. If it is missing, it checks if the fill variable is numeric. If it is numeric, the function uses scale_fill_viridis_c to create a continuous color scale. If it is not numeric, the function creates a legend with shape = 21 (a filled circle). If alpha.index is provided, the function creates a continuous color scale with scale_fill_viridis_c using alpha.index as the label for the legend.

Code
plot_nmds <- function (scores.long, fill.var, meta.data, env.long, alpha.index, alpha.div) {
  
  if (!missing(alpha.div)) {
    scores.long.joined <- scores.long %>%
      left_join(., alpha.div, by = c("sample-id", "tax_level")) %>%
      left_join(., meta.data, by = "sample-id") %>%
      filter(index %in% alpha.index)
  } else {
    scores.long.joined <- scores.long %>%
      left_join(., meta.data, by = "sample-id") 
  }
  
  # if (missing(size.var) ) {
  #   scores.long.joined[[size.var]] = "size"
  # }
  base_plot <- ggplot() +
    geom_point(data = scores.long.joined,
               aes(NMDS1, NMDS2, 
                   fill = .data[[fill.var]]), 
               #shape = .data[[shape.var]], 
               #size = .data[[size.var]])#,
               size = 3,
               shape = 21) +
    facet_wrap(~factor(tax_level, tax_levels)) +
    theme(legend.title = element_text(size = 10),
          axis.title.x = element_blank(), axis.title.y = element_blank()) +
    theme_minimal() +
    guides(size = "none") +
    ggtitle("NMDS (Bray-Curtis)")
  
  if (missing(env.long)) {
    base_plot
  } else {
    base_plot <- base_plot +
      geom_segment(data = env.long, 
                   aes(x = 0, xend = NMDS1,
                       y = 0, yend = NMDS2, 
                       color = phylum)) +
      geom_label_repel(data = env.long, aes(x = NMDS1,
                                            y = NMDS2,
                                            color = phylum,
                                            label = tax3),
                       max.overlaps = 100, direction = "both", hjust = 0) +
      scale_color_manual(values = phylum_color_dict) +
      guides(color = "none")
  }
  
  if (missing(alpha.index)) {
    
    if (is.numeric(scores.long.joined[[fill.var]])) {
      base_plot +
        labs(fill = paste0(fill.var)) +
        scale_fill_viridis_c()
    } else {
      base_plot +
        labs(fill = paste0(fill.var)) +
        guides(fill = guide_legend(override.aes = list(shape = 21)))
    }
  } else {
    base_plot +
      labs(fill = paste0(alpha.index)) +
      scale_fill_viridis_c()
  } 
  
}

Viewing the Plots

In this example, 4 plots are generated in total, however the plot_nmds() function is written to be flexible based on what the user wants to fill their plot by. The first two generate NMDS plots for the pres_scores_long and rel_scores_long data frames with fill colors indicating Alpha diversity (Shannon index). The alpha diversity data is passed in the alpha.div parameter. The fill color variable is set to “value” and the metadata data frame is passed in the metadata parameter. The plots effectively model Alpha diversity (filled continuous scale) vs. Beta diversity (NMDS points) from the relative abundance/presence-absence matrices, which helps informs the basic understanding of community structure. In addition, you could also fill the plots by the Simpson and Inverse Simpson Alpha diversity indices as well to see how those different metrics compare to each other. The next two plots generate NMDS plots for the pres_scores_long and rel_scores_long data frames with fill colors indicating the region for the pres_scores_long plot and zone for the rel_scores_long plot. The fill color variable is set to “region” for pres_scores_long plot and “zone” for rel_scores_long plot. The metadata data frame is passed in the metadata parameter. the user needs to specify both the alpha.div and alpha.index arguments if they want to fill by alpha diversity data. alpha.div is the R object containing alpha diversity data, and alpha.index is one of “simpson”, “invsimpson”, or “shannon”. If users instead want to explore how the NMDS plots look filled by a metadata variable (e.g., “region” or “zone”), they can as well

Code
plot_nmds(pres_scores_long, fill.var = "value", metadata, alpha.index = "shannon", alpha.div = alpha_diversity)

Code
plot_nmds(rel_scores_long, fill.var = "value", metadata, alpha.index = "shannon", alpha.div = alpha_diversity)

Code
plot_nmds(pres_scores_long, fill.var = "region", metadata)

Code
plot_nmds(rel_scores_long, fill.var = "zone", metadata)